Research 

Report 


Reliability and the 
Nonequivalent Groups 
With Anchor Test Design 


Tim Moses 
Sooyeon Kim 



April 2007 
RR-07-16 



Reliability and the Nonequivalent Groups With Anchor Test Design 


Tim Moses and Sooyeon Kim 
ETS, Princeton, NJ 


April 2007 



As part of its educational and social mission and in fulfilling the organization's nonprofit charter 
and bylaws, ETS has and continues to learn from and also to lead research that furthers 
educational and measurement research to advance quality and equity in education and assessment 
for all users of the organization's products and services. 

ETS Research Reports provide preliminary and limited dissemination of ETS research prior to 
publication. To obtain a PDF or a print copy of a report, please visit: 

http://www.ets.org/research/contact.html 


Copyright © 2007 by Educational Testing Service. All rights reserved. 

ETS and the ETS logo are registered trademarks of 
Educational Testing Service (ETS). 


ETS 





Abstract 


This study evaluated the impact of unequal reliability on test equating methods in the 
nonequivalent groups with anchor test (NEAT) design. Classical true score-based models were 
compared in tenns of their assumptions about how reliability impacts test scores. These models 
were related to treatment of population ability differences by different NEAT equating methods. 
A score model was then developed based on the most important features of the reviewed score 
models and used to study reliability in a simulation study across a total of 45 measurement 
conditions (= 5 test and anchor reliability combinations x 3 population ability difference 
conditions x 3 sample sizes). Ten equating methods were considered: chained linear, chained 
equipercentile with raw and smoothed frequencies, Tucker, frequency estimation equipercentile 
with raw and smoothed frequencies, Levine observed using Angoff-estimated and the “correct” 
reliabilities based on the data generation model used in this study, and Levine true using Angoff- 
estimated and correct reliabilities. The results were consistent with what is known about equating 
functions and their variability. Unequal and/or low reliability inflates equating function 
variability and alters equating functions when population abilities differ. 

Key words: Reliability, NEAT equating design, classical true-score model, classical congeneric 
model, generalizability theory model, chained methods, conditioning methods, Levine equating 
method 
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Introduction 


Reliability is often regarded as an important aspect of acceptable test equating. One of the 
basic requirements of test equating is that the test forms to be equated be equally reliable (Allen 
& Yen, 1979; Angoff, 1971; Dorans & Holland, 2000; Kolen & Brennan, 2004; Lord, 1980; 
Petersen, Kolen, & Hoover, 1989). High reliability is also desirable, though not usually described 
as a specific requirement of equating. 

The focus of this paper is the impact of reliability on equating for the nonequivalent 
groups with anchor test (NEAT) design. Reliability has particularly important implications for 
NEAT equating, where the objective is to separately identify the contributions of examinee 
ability and test form difficulty on test scores in order to adjust test scores for fonn difficulty 
differences. Reliability is first described in terms of its assumed role in different test score 
models. Next, the score models are described in terms of which equating model appropriately 
accounts for the score models’ population ability differences. Finally, reliability’s effects on 
equating functions are illustrated in a series of simulations. 

The Impact of Reliability on Scores 

There are many models of test scores, and reliability is given different roles in each 
model. This section compares two classical true score-based score models in tenns of the roles 
they assign to reliability. The simplifying assumption that the tests’ and anchors’ true scores are 
perfectly related is made throughout this discussion and the paper. 

Classical True Score Theory 

In classical true score theory, observed scores are modeled as the sum of “truth” and 
“error,” 


X= T + E. (1) 

In (1), the expected score equals the true score, e(X) = px = T, and T and E are independent so 
that their covariance, o(T,E), is zero. The independence assumption allows observed score 
variance to be expressed as the sum of true score and error variance, 

o 2 (X) = <j 2 (T) + <j 2 (E). (2) 

Reliability is defined as the ratio of true score variance to observed score variance, 
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v 2 (T) 

cf(T) + a 2 (E)' 


( 3 ) 


rel x = ■ 


The two models reviewed next retain the main characteristics of classical true score theory 
(z(X) = T and o(T,E) = 0). The unique and reliability-relevant aspects of these models are in how 
they structure T and E to specify test form difficulty and examinee ability effects (Table 1). 

Table 1 

Summary of Score Models and the Roles These Models Give to Reliability and Examinee 
Ability 

Model Test score = [Test difficulty] + [Examinee ability] + [Error (unreliability)] 

[ 5x ] + [ ] + [ E x ] 

[ I>< ] + [ n, t ] + [ X Ah, ] 


Classical 
congeneric model 
Generalizability 
theory 


X 


X = 


Classical Congeneric Models 

Classical congeneric models (Brennan, 1990; Feldt & Brennan, 1989) specify the 
contributions of difficulty, ability, and reliability on congeneric test and anchor scores. Tests (X) 
and anchors ( A ) are modeled as 

X= (Ty) + (E x ) = (X x T + 8 X ) + (Ex), 


A = (T a ) + (E a ) = (X a T+ 8 a ) + (E a ). (4) 

The X terms are defined as effective test lengths (Brennan, 1990) for which this paper’s 
discussion makes use of a definition of effective test length as a reliability-dependent true score 
standard deviation (Angoff, 1953, 1971, pp. 114-115; Kolen & Brennan, 2004, pp. 112-113), 


a(T x ) = JreCa(X) 

= I *W(T) 

p 2 x a 2 (r) + a 2 (E x ) 
= X x a(T) 


4xW(T) + cf(E x ), 
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_ / y\ 

so that A y = Jrely —— . T is the test taker true score and underlies the tests’ and anchors’ true 
v x a( T > } 

scores. The test and anchor are congeneric, meaning that their true scores, T x = 


8 X and T A = ^Jrel ,- - T + 8 A , are perfectly related. The 8 terms in (4) are constants that 

a(T) 

determine the difficulty or ease of the test and anchor. The E tenns have expectations of zero and 
are independent of T. Test and anchor reliabilities in the classical congeneric model are 


rel x a } X) a 1 (T) , 2 

l X <J 2 (T) relyOAX) _ a\T X ) 

X " re lx ^±v\T) + *\E x ) " rel x ^X) + <r(E x ) ~ a\T x) + a\E x ) ’ 
crfT) 


rel A = 


, cr 2 (A ) 2 / r r \ 

rel, _- a (T) 

A a 2 (T) 


rel A a 2 (A ) 


° 2 (t a ) 




The classical part of classical congeneric theory is the assumption that error variances are 
proportional to effective test length: 


i _ 2/r\ — L.„i a (X) 2 . 


o\E x ) = AxcEiE) = ^rel x ^ct 2 (E) , 

a(T) 


o\E a ) = E a o\E) = ^T A ^-a 2 (E ). 

a(T) 


The classical congeneric model has two important implications for scores and equating. 
First, the proportionality of the error variances across the tests and anchors in (6) allows 
reliability and the test and anchor error variances to be estimated from the observed variances 
and correlations of the test and anchor scores (Angoff, 1953). Second, while the mean observed 
scores (px) are equal to mean true scores (prx), the population ability effect f jli/) on mean 
observed scores is directly influenced by reliability, 
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i — r cr ^) ? 

1-U M TX yjrel x / rr l-h C U ’ 

a(T) 


l-M - Fra - 



gtj) 

a(T) 


H t + < 5 ^ • 


(7) 


For classical congeneric models, the role of unreliability on test scores is not only to 
influence the proportion of true score to observed score variance. Unreliability also biases the 
extent to which overall test taker abilities are visible on observed scores. 


Generalizability Theory 

Generalizability theory extends classical true score theory by using analysis of variance 
models to separately identify different sources of the error left undifferentiated by classical true 
score theory (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972). One of the 
simplest designs is sufficient for identifying the role of reliability on scores of test takers ( t ) 
sampled from some population of interest-taking items (/) sampled from a universe of admissible 
items. The score on item i for test taker t (X ti ) is modeled as 

X ti = t + Vi + v„. ( 8 ) 

The t reflects test taker f s ability and has expectation e(t) = p, = s l s i ( X ti ). The effect v, is 

interpretable as the influence of an easier or more difficult item, which introduces absolute error 
when decisions are made based on the absolute values of observed scores (e.g., classifications 
with respect to a cut-score). The effect v t , is the interaction of test takers with items that is 
confounded with all other sources of error, which introduces relative error for decisions based on 
the relative standing of test takers on their observed scores. 

The error effects have zero expectations, 

e,<Vf) = s t (vtd = e«<Vtf) = 0, (9) 

and are assumed to be uncorrelated with the other terms in the model and the effects of other 
items (/ ’ ^ /), 

<j(tv ti ) = g(v,v, ; ) = a(tv,) = a(v,v/) = o(y ti v ti •) = 0. (10) 
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To emphasize how the generalizability theory model in (8) relates to the classical 
congeneric model in (4), (8) can be used to express test takers’ scores (i.e., the sums of their n, 
item scores) on tests (. X t .) and anchors (A ,) that are based on items (not necessarily the same) 
sampled from a common universe: 

"i.X >h,X » i,X n iJC 

^ t . ~ Z — Z (t + v. x + v tiX ) — n iX (t) + Z v iX + Z v tiX ? 

i,X i,X i,X i,X 

n i,A n i,A n i,A n i,A 

A. = Z Ai = Z 0 + V uA + K-a) = \ A (t) + X V i,A + Z V ti,A . (1 1 ) 

i,A i,A i,A i,A 


From (11), examinees take both X and A, so that test taker variance (i.e., true score 
variance) contributes to the observed variances of both X and A. Because the items on X and A 
are parallel, item effects for items in X ( v j x and v ti x ) and A (v iA and v ti A ) both have variances 

<t 2 (0 and cr{ti) . Observed score variances for the scores in (11) are therefore defined as 


A(x l ) 


n 2 I cr 2 (0 + 


a 2 (i) 


n i,x ) ’ 


2/ x 2 ^ 2 / x cr 2 (0 cr 2 (h)^ 

cr14.)= n 2 cr-fi) + —— +—- 1 


( 12 ) 


Test and anchor reliabilities are defined as 


rel x 


a\t) 




rel A 


a 2 {t) 




i,A 


(13) 


The reliability coefficients in (13) are identical to coefficient alpha and, for dichotomously- 
scored items, the KR-20 reliability coefficient. Classical definitions of reliability focus 
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exclusively on relative error (v ft ) rather than absolute error (v,) for defining error variance. 
Reliabilities across congeneric tests and anchors differ only with respect to test and anchor 
lengths, a fundamental assumption that follows from the test taker population and item universe 
that is used in the decision studies that typically accompany generalizability analyses. 

Comparing the Classical Congeneric and Generalizability Theory Models 

Important distinctions exist between the classical congeneric and generalizability theory 
models that are relevant for equating. First, the source of test form difficulty differences is 
different for each model. For generalizability theory models, test fonn difficulty differences are 
assumed to be due to random samples of items that do not always have mean difficulties that 

Hj 

converge to their expected value of zero (i.e., while 8 ,{yi) = 0. ^ v. may not equal zero for every 

i 

sample of items and therefore cr (/) is not necessarily equal to zero). In classical congeneric 
models, test form difficulty differences are described as systematic difficulty effects rather than 
as sampling effects (i.e., the 8 terms are defined as constants rather than as random variables). In 
addition, reliability across tests and anchors composed of parallel items in generalizability 
models does not directly affect the scores’ extent of ability effects (changes in cr (0 and cr (h) 
do not necessarily affect z,(X t ) = «upq), while in classical congeneric models, reliability has a 

_ / y\ 

direct effect on a score’s extent of ability effects (\±x = q tx= Jrel x — — p T + 8 X ). 

a(T) 

The Impact of Reliability on Equating 

The purpose of equating is to adjust the scores of test forms that are intended to be 
parallel for unintended differences in difficulty. NEAT equating matches nonequivalent 
administration groups on their ability, where ability differences are estimated from mean anchor 
score differences. When one of two test forms (X or Y) is given to an independent sample of one 
of two populations (P or Q) along with an anchor test (A), X P is equated to test Yq, and the 
anchor scores ( A P and Aq) are used to account for ability differences in the populations. The 
following presentation focuses on contrasting the major equating methods’ treatment of 
population ability differences when scores are unreliable and follow either the classical 
congeneric model or the generalizability theory model. Some previous works have compared the 
equating methods that directly infonned this section (Holland, 2004; Kolen & Brennan, 2004), 
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and other works have informed this section’s relating of the classical congeneric model to Levine 
equating (Brennan, 1990; Hanson, 1991). 

Four common linear equating methods in the NEAT design (Tucker, chained linear, 
Levine observed, and Levine true) all incorporate ability differences between populations P and 
0 in the Xp-Xo-Yq equating function. Ability difference information is expressed as 

<?(Y 0 ), 

< 14 > 


Here, (14) shows that the population ability difference observed in P and 0’s anchor means is 
standardized according to Aq s variability and scaled to Yq s variability. The y ey is a tenn that is 

specific to each equating method and describes the unique way an equating method scales mean 
anchor score differences to Yq. 

Here, (14) is part of the chained linear equating function, 


dy(X p ) 


CT(Y Q ) a(A p ) 
g(A q ) a(X p ) 


(Xp /J X p ) + /Jyg + 


°(Y Q ) 

g(A q ) 


(P AP ~Paq), 


(15) 


and the Levine true equating function, 


lty(X p ) 


^jrel YQ a(Y Q ) yjrel AP a(A p ) 
jrel AQ (J(Aq) yjrel xp cr(X p ) 


(X p Hxp ) + fj, Y Q + 


yjrelyg G(Yq ) 
j rel AQ °( A q) 


(Pap " Paq ) . 


(16) 


Also, (14) is used to estimate the mean of the unobserved Yp for the Levine observed method 


\] re ^YQ g(Y q ) 

Pyp ~ Pyq w I ■ 7r (Pap ~ Paq ) . (17) 

^rel AQ g(A q ) 


Finally, (14) is used to estimate the mean of the unobserved Y wP +(i. W )q for the Tucker method, 
expressed here with a YqAq correlation ( p YQAQ ) based on the assumption of congeneric tests and 

anchors ( p Y q A q = qPpI A q ) 
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, - <j(Y q ) 

M YwP+(\-w)Q ~~ MyQ + W ^ re ^YQ re ^AQ ~^(M AP ~ M AQ ) . 


(18) 


Scaling Population Ability Differences in Terms of Observed Scores 

The chained linear method utilizes an observed score scaling of the population ability 
differences in (14) by setting y ey =1 in (15). The chained method’s use of observed score variance 

rather than true score variance has the advantage of simplicity and of dealing with directly 
observable variances. When data follow a generalizability theory model, the observed score 
scaling is defensible because the observed score means are equal to average true ability 
( s t s i ( X n ) =p, in (8)). Observed score scaling of the difference in mean A P ( n AF = ) and mean 

Aq (/u aq = n iA n, Q ) can be expressed as 


a(Y 0 ) 


n 


iY ' 


0-2 (tQ) + 


<r(jY) ( <r«8/T) t 

fljY WjY / 


( , 

1 ( { Q)+ 


. ( H iAMtP n iA^tQ) ng , 

a 2 (i A) cr(tQiA)\ • (iy ) 


+ 


n 


iA 


‘iA 


I 


When data follow a classical congeneric model, the chained linear method’s y er =1 results in 


r, 


a ( y q\.. .. ^ ,J<* y qA, rs-^ A p) ~ 


* oCAq) 


(Map - Maq) = (^l reI AP -^pj^TP + 5 * ~ ^ 


Ae *(t 0 ) Mtq Sa) ’ 


( <XY e ) ) 
°(A 0 ) 


(sl re ^AP _ /T . x MtP \j re ^AQ ^rp ^ MtQ ) . 


o-(A 0 ) 


a(T P ) 


<?(T 0 ) 


( 20 ) 


If the anchor scores are very unreliable in a classical congeneric model, chained linear method’s 
observed score scaling of ability differences will reflect a biased estimate of true population 


ability differences, 


^7p 


off) 

<KT P ) 


r-i— a ( A o) 

MtP yj re ^AQ ^ ( rp ^ MtQ 


v(T q ) 


< 


o(A P ) °( A q) 

a(TA Mr ” “ a(T„) 
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Scaling Population Ability Differences in Terms of True Scores 

The Levine true and observed methods utilize a true score scaling of the population 
ability differences in (14), meaning that they set y ef equal to the ratio of Yq and Aq s root 

reliabilities, UJ in (16) and (17). The true score scaling of ability differences is especially 

\] re ^AQ 

defensible when the data follow a classical congeneric model. True score scaling of ability 
differences can be justified when the data follow a generalizability theory model under limited 
conditions. 

When the data follow a classical congeneric model, true score scaling in (14) results in 


( 




^el 


AQ J 


f oU'cfi 

o-(A 0 ) 


Q rel AP 


AM,, _ 

a(T r ) MlF * Ae a(T Q ) 


( 21 ) 


When rel AP —rel AQ and a(A p ) — a( A (J ) , (21) can be written directly in terms of Yq s true score 
variance as 


<22) 

When the data follow a generalizability theory model, true score standard deviations are 
the product of test length and the test taker variance component (n iY a(tQ) and n u a{tQ) , as in (12), 
so that true score scaling essentially involves the lengths of Aq ( n iA ) and Yq ( n iY ), 




1 


1 


a\tQ) 




i 


a 2 (tQ) + 


cr-(/) cr(tQi) 




n tI 0-2 W I vdtQi) 

n iA „ ° ( t Q) + -+- 


\n,AP tP -n iA p tQ ) . (23) 


When n jA and n ir are large and/or a 2 (i) is relatively small (i.e., items and forms are long and/or 


do not differ widely in difficulty), , cr (0 


2 s^\ , cr 2 (/) af ti) 


n, n. 


2 / \ cr~ ( ti ) 

o' (0 +-, so that (23) becomes 
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”,T &(tQ) 

n iA (T ( t Q) 


fruVtP 




When scaling according to true score variance in generalizability theory, the ability difference is 


scaled using the ratios of the actual lengths of Yq and Aq( 


)• 


When scaling according to true score variance in classical congeneric theory, the ability 


difference is scaled using the ratios of the effective lengths of Yq and Ag ( 


JrelyQ t( Yq ) 
^jrel AQ <T(A Q ) 


). For both 


score models, the average true score differences between the populations are potentially scalable 
according to how each theory defines the true score scale of Yq. 


Using Population Ability Differences in an Observed-Score Regression 

Tucker equating assumes that the linear regressions of observed test scores on observed 
anchor scores are test- and anchor-specific rather than population-dependent. Assumptions about 
true scores and errors do not directly inform the linear regression used by the Tucker method, 
though some correspondence to true score theory can be observed by noting that the correlation 
between congeneric tests and anchors is expressible in terms of test and anchor reliabilities. An 

estimate of the mean of Y wP+( i. w) q, MywI>+(\-w)Q , can be obtained, as in (18), by applying the 

synthetic population-weighted Yq\Aq regression at score \i A p- When unreliability weakens the 
Yq\Aq regression, it essentially discounts the extent to which anchor score mean differences are 
incorporated in the Tucker equating function. This discounting is very different from the way the 
chained linear and Levine methods utilize the mean ability difference information in (14), and it 
is inconsistent with how reliability is assumed to affect scores generated from classical 
congeneric and generalizability theory models. 


Unreliability and Equating Bias 

From the previous section’s discussion, the impact of unreliability on equating bias can 
be understood as a misinterpretation of the anchor score information (i.e., using an equating 
method that incorrectly scales \jlap-\aaq), as in (14). For example, if reliability’s effect was only 
on true score and error variances and not on ability effects, the Levine method would assume that 
true ability differences were bigger than the observed \iap-Yaq and overmatch on the ability 
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difference by setting y ey = , = . In contrast to the Levine method, the Tucker method would 

incorrectly discount the observed \i A p-\\. A Q and undermatch for jli. j/>-jli^o by applying a regression 
and Setting Yey = wJrel rQ rel AQ . 

If reliability’s effect was to bias the true ability difference information observed in \i A p- 
]a A q (as in classical congeneric models), then the chained linear and Tucker methods would 
undennatch on the true ability in \i A p-\i A Q- The Tucker method would undermatch more so than 
the chained linear method because the former’s setting y cy = wJrel YQ rel AQ would incorrectly 

reduce the observed ability difference more than the chained linear method’s setting y = 1. If the 

observed test and anchor scores followed population-invariant linear regression models, Tucker 
would correctly utilize the regressions, and the chained linear and Levine methods would 
incorrectly ovennatch on \Iap-\i A q, the Levine method more so than the chained linear one 

because the Levine method’s incorrect setting y ey = ^ >g will likely magnify [i A p-[Iaq while the 

yj re ^AQ 

chained linear method’s setting Yey = 1 will be closer to the Tucker method’s y ey = w y jrel rQ rel AQ . 

Unreliability and Equating Standard Errors 

The impact of reliability on equating variability can also be understood in tenns of the 
different equating methods’ versions of (14). Specifically, \i ap -\i A q has a sampling variance that 
will impact equating standard errors. The Tucker method’s tendency to downweight \a A p-\1aq 
based on the test-anchor correlation would also reduce equating standard errors. The Levine 
method’s tendencies to magnify \i A p-\i A g based on the ratio of test and anchor root reliabilities 
would magnify equating standard errors. These statements correspond to previous findings of the 
relative ordering of equating function standard errors, where approaches that use the anchor as a 
conditioning variable (Tucker and frequency estimation equipercentile) are less variable than 
approaches that use the anchor to form a chain of links between two test fonns (chained linear 
and chained equipercentile), which are in turn less variable than the Levine approaches (von 
Davier, Holland, & Thayer, 2004; von Davier & Kong, 2005; Kolen & Brennan, 2004; Wang, 
Lee, Brennan, & Kolen, 2006). To the extent that unreliability makes equating functions more 
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variable, it should do so more for the Levine and chained approaches than for the conditioning 
approaches. 


Method 

The issues of reliability on test equating results were explored in a simulation study. A 
data generation model was developed to reflect the following main features of classical true 
score theory, classical congeneric models, and generalizability theory applied to congeneric tests 
and anchors: 

• Observed scores are the sum of test-taker truth plus error. 

• The expected value of the observed scores is the true score. 

• The true scores are independent of error. 

• The correlation of the test and anchor true scores is 1. 

The data generation model was also developed in order to manipulate reliability 
independently of other test score features, including observed score variances, true population 
ability differences, observed lengths, and no-test-form difficulty differences. This conception of 
reliability was the basis for studying how equating method averages and standard errors were 
impacted across combinations of sample size, reliability, equating method, and population ability 
difference. 

Data Generation Model 

The X P , A P , Aq, and Yq scores were generated as sums of independently and normally 
distributed truth and error variables 

X P = T XP + Exp, 

Ap = Tap + Eap, 

Aq = Taq + Eaq, 

Yq = Tyq + Eyq. (24) 

In (24), the true scores in population P ( T X p and Tap) and in population Q ( T A q and Tyo) were 
perfectly correlated, differing only in their variances ( cr 2 lxp * <j 2 tap and <j 2 tyq ^ cr PAO ). The mean 
test and anchor scores were functions of actual lengths, 
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(XX p ) &'(T xr ) n xp )ip ,s(A p ) s(T ap ) n ap fj,p, s{Yq) s(T Y q ) , and 

s(A 0 ) = s(T iQ ) = n AQ ju Q ). The test lengths ( n xp and n YQ ) were set equal to 100 and the anchor 

lengths ( n AP and n 40 ) were set equal to 30. The anchors were external to the tests. The variances 

of the true scores and error scores in (24) were manipulated to produce test and anchor scores of 

0"2 (T ^ 

desired reliability levels (e.g., rel w = — --, Equation 3), while achieving desired 

(J-(T xp ) + <7-{E xp ) 

2 2 2 

observed score variances (e.g., <j"(X p ) = a ( T X p ) + <J~(E X p), Equation 2). The observed score 
variances were kept equal for the tests ( <J XP = a \ Q ) and the anchors ( a 2 AP = cr AO ). To also 

consider equipercentile methods, the final scores were rounded to integer units and truncated into 
desired score ranges. The means and variances of the Xp, A />, Aq, and Yq scores in (24) are 
summarized in Table 2. 

There are several implications of the data generation model in (24): 

• The standardized difference in anchor score means was equal to the standardized 
difference in the test means throughout the study. The equality of test and anchor 
standardized mean differences was an operationalized definition of no difficulty 
differences across test forms X and Y. In other words, no systematic difficulty effects 
were built into (24) and no standardized mean differences on test fonns X and Y were 
present that could not also be observed in the standardized mean differences on the 
anchor. 

• The data generation model corresponded to the chained linear method’s observed 
score focus. Note that the expression of the chained linear equating function (15) 
based on this study’s data generation model’s constraints resulted in the identity 
equating function, the equating function that would be appropriate when test forms 
have no true difficulty differences: 


dy(X p ) 


7 T 

g(A q ) cr(X p ) ' ct(A q ) 


AP - Haq) 



<XXo) 


(Xp ll xp ) + f-lyq + (Map ~ /- l AQ ) 5 

0\Aq ) 
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because the test variances were kept equal and the anchor variances were also kept equal, and 


X p + fly ( j fl Xp 


O-(Aq) 


o AP ~Maq) , 


— X p s because 


Pro /frp _ Pa<q /Tip 


—tv \ 




Table 2 

Data Collection Design (Nonequivalent Groups With an Anchor Test Design), Equation 24, 
Where Score = T + E 


Population 

Score 

N 

p (= e(T)/n) 

Score mean (= n\x) 

Score SD a 

Population standardized ability difference = 0 

P 

X 

100 

.5000 

50.000 

18.0 

P 

A 

30 

.5000 

15.000 

5.4 

Q 

A 

30 

.5000 

15.000 

5.4 

Q 

Y 

100 

.5000 

50.000 

18.0 

Population standardized ability difference = .15 

P 

X 

100 

.5135 

51.350 

18.0 

P 

A 

30 

.5135 

15.405 

5.4 

Q 

A 

30 

.4865 

14.595 

5.4 

Q 

Y 

100 

.4865 

48.650 

18.0 

Population standardized ability difference = .30 

P 

X 

100 

.5270 

52.700 

18.0 

P 

A 

30 

.5270 

15.810 

5.4 

Q 

A 

30 

.4730 

14.190 

5.4 

Q 

Y 

100 

.4730 

47.300 

18.0 


a The score standard deviation is determined as ^<j 2 (T) + cr 2 (E) , where cf{T) and <j 2 (E) are 
determined to obtain a desired reliability and observed score standard deviation. 
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The manipulation of test and anchor reliabilities, which are independent of test and 
anchor lengths, and of observed score variances was inconsistent with the assumptions of 
generalizability theory and the classical congeneric models. This inconsistency was deliberately 
created to set up situations where the Levine equating method using the classical congeneric- 
based Angoff (1953) reliability estimates could be studied when the Angoff reliability estimates 
were incorrect. 

Population Standardized Ability Differences 

Three population standardized ability difference conditions were defined as standardized 
mean differences on the anchor scores for the P and Q. This study considered population 
standardized mean differences of 0, .15, and .30. For nonzero population standardized mean 
differences, P was more able than Q. 

Reliability 

Five combinations of test and anchor reliabilities are presented in Table 3, where 
reliabilities ranged from high (.9 and .8), medium (.7 and .6), to very low (.5 and .4). The 
reliabilities of anchors Ap and Aq were always equal and always less than the reliabilities of tests 
X P and Yq. Two reliability combinations were such that the reliabilities of Ap and Yq were 
unequal (reliability combinations of rel xp _rel AP _rel AQ _rel YQ =.9_.5_.5_.7 and .7_.5_.5_.9). 

Unequal reliabilities among total tests mean that, technically, equating cannot be done. This 
study’s references to equating method results when the tests have unequal reliabilities are 
intended as descriptions for the performance of the equating methods under conditions where 
adequate equating is impossible. Holding the observed standard deviations for the test and 
anchors constant but varying reliability produced a situation where the Angoff (1953) reliability 
estimates were not accurate. Their extent of inaccuracy for the reliability conditions and 
observed score standard deviations in this simulation is shown in Table 4. 

Sample Size 

Three sample size conditions were considered for P and Q\ Np = Nq = 500, 1,000, and 5,000. 
Equating Methods 

Ten NEAT equating methods were considered for equating Ap to Yq through anchors A P 
to Aq. These are the linear methods described in the introduction (chained linear, Tucker, Levine 
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observed and Levine true) and the equipercentile counterparts of the linear methods (chained 
equipercentile with raw and smoothed frequencies and frequency estimation equipercentile with 
raw and smoothed frequencies). Loglinear smoothing (Holland & Thayer, 1987, 2000) was used 
with the equipercentile methods to preserve four moments on the test and anchor distributions 
and one cross-product moment between the tests and anchors. The Levine observed and true 
equating methods were considered using Angoff (1953) reliability estimates for a test and an 
external anchor and also using the correct reliabilities (i.e., the reliabilities by which the data 
were actually generated). 

Table 3 


Reliability Levels in Two Test Scores and Two Anchor Scores 


Combination 

x P 

Ap 

Aq 

Yq 

1 

.9 

.8 

.8 

.9 

2 

.9 

.6 

.6 

.9 

3 

.7 

.4 

.4 

.7 

4 

.9 

.5 

.5 

.7 

5 

.7 

.5 

.5 

.9 

Table 4 





Correct and Angoff-Estimated Reliabilities for the Conditions of This Study 

Combination 

x P 

Ap 

Aq 

Yq 

1 

.9 (.93) 

.8 (.78) 

.8 (.78) 

.9 (.93) 

2 

.9 (.87) 

.6 (.62) 

.6 (.62) 

.9 (.87) 

3 

.7 (.74) 

.4 (.38) 

.4 (.38) 

.7 (.74) 

4 

.9 (.83) 

.5 (.54) 

.5 (.45) 

.7 (78) 

5 

.7 (.78) 

.5 (.45) 

.5 (.54) 

.9 (.83) 


Simulation 

For the simulation, 200 random datasets of X P , A />, Aq, and Yq scores were generated for 
particular combinations of sample size, reliability, and population standardized ability 
differences. The 10 equating methods were used to equate X P to Yq in each of these 200 datasets. 
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For each possible score onJp (0-100), averages and standard deviations of the 200 equated 
scores were computed for each equating method. These equating method averages and standard 
deviations (i.e., empirical standard errors) were then analyzed across equating method, sample 
size, reliability combination, and population standardized ability difference. 

Evaluation of Results 

Analysis of variances (ANOVAs) and source mean squares were used to identify the 
strongest influences on equating method averages and standard errors. Specifically, the equating 
method averages and standard errors of converted scores at an Xp score of 50 were analyzed in 
15-factor ANOVAs composed of the 4 main effects (i.e., the 10 equating methods, 3 sample 
sizes, 5 reliability combinations, and 3 population standardized ability differences); 6 two-way 
interactions; 4 three-way interactions; and 1 four-way interaction. The percentages of total 
variance in these 15 effects gave a general indication of how each manipulated variable 
contributed to the variation in equating method averages and standard errors. The ANOVA 
results, like ANOVA results from any controlled study, directly reflect the levels of the variables 
considered in the study (which were selected because they spanned a range of situations 
encountered by this study’s authors in their equating work). Additional follow-up analyses for 
the equating method averages and standard errors were also conducted to describe the results not 
adequately captured by the ANOVA analyses. 

Results 

Equating Method Averages 

Table 5 presents the mean squares from the ANOVA of the 15 effects for the equating 
method averages for score Ap = 50. These mean squares are ranked in terms of their proportion 
of variance explained on equating method averages. Over 99% of the variation in equating 
method averages is attributable to the main and interaction effects of equating method and 
population standardized ability differences and the interaction of these two effects with reliability 
combinations (equating, equating x ability, equating xrcliability, and equating x ability x 
reliability). The sample size effect on equating method averages was negligible. 
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Table 5 


Ranked Mean Squares and Their Percentage of Total Variation in Equating Method Averages 


Source 

DF 

Mean 

% of total 

Cumulative 



square 

variance 

% variance 

Equating 

9 

18.92 

72 

72 

Equating x ability 

18 

6.36 

24 

96 

Equating x reliability 

36 

0.62 

2 

98 

Equating x ability x reliability 

72 

0.21 

1 

99 

Ability 

2 

0.15 

1 

100 

Reliability 

4 

0.03 

0 

100 

Ability x reliability 

8 

0.03 

0 

100 

Ability x sample size 

4 

0.02 

0 

100 

Ability x reliability x sample size 

16 

0.01 

0 

100 

Reliability x sample size 

8 

0.01 

0 

100 

Sample size 

2 

0.00 

0 

100 

Equating x sample size 

18 

0.00 

0 

100 

Equating x reliability x sample size 

72 

0.00 

0 

100 

Equating x ability x reliability x 
sample size 

144 

0.00 

0 

100 

Equating x ability x sample size 

36 

0.00 

0 

100 

Total 

449 

26.35 

100 



Note. Xp= 50. 


The influences of population standardized ability differences and reliability combination 
are illustrated in Table 6 (population standardized ability difference = 0), Table 7 (population 
standardized ability difference = .15), and Table 8 (population standardized ability difference = 
.30), which give the equating method averages for each equating method across the five 
reliability combinations averaged across all of the sample sizes. The equating method averages 
are essentially equal to the criterion equated score of 50 across reliability conditions when 
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population abilities do not differ (Table 6). When abilities differ (Tables 7 and 8), the equating 
method averages become more dependent on equating method and also on reliability levels, so 
that the conditioning methods (Tucker, raw and smoothed frequency estimation equipercentile) 
give progressively lower equating method averages as reliability declines, the chained methods 
(chained linear, raw and smoothed chained equipercentile) change only slightly, and the Levine 
methods give progressively higher equating method averages as reliability declines. 


Table 6 

Equating Method Averages Across Reliability Combinations When P and Q Abilities Were 
Equal and Population Standardized Ability Difference = 0 


Equating method 


Reliability combination 



to 

l. 

oo 

1. 

bo 

1 

to 

.9_.6_.6_.9 

,7_.4_.4_.7 

.9_.5_.5_.7 

.7_.5_.5_.9 

Tucker 

50.02 

50.02 

49.99 

50.04 

50.05 

Raw frequency 
estimation equipercentile 

50.02 

50.03 

49.96 

50.06 

50.06 

Smoothed frequency 
estimation equipercentile 

50.01 

50.02 

49.99 

50.04 

50.05 

Chained linear 

50.01 

50.02 

50.00 

50.05 

50.05 

Raw chained 
equipercentile 

50.03 

50.00 

49.94 

50.12 

50.09 

Smoothed chained 
equipercentile 

50.00 

50.00 

49.99 

50.08 

50.04 

Levine observed-correct 
reliabilities 

50.01 

50.02 

50.00 

50.06 

50.05 

Levine observed-Angoff 
reliabilities 

50.01 

50.02 

50.00 

50.06 

50.05 

Levine true-correct 
reliabilities 

50.01 

50.01 

50.00 

50.06 

50.04 

Levine true-Angoff 
reliabilities 

50.01 

50.02 

50.00 

50.06 

50.05 


Note. Xp = 50. 
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Table 7 

Equating Method Averages Across Reliability Combinations When P and Q Abilities Were 
Unequal and Population Standardized Ability Difference = .15 


Equating method 


Reliability combination 



to 

l. 

oo 

1 . 

bo 

1 

to 

•9_.6_.6_ 

.9 ,7_.4_.4_.7 

.9_.5_.5_.7 

.7_.5_.5_.9 

Tucker 

49.60 

49.27 

48.75 

48.98 

49.00 

Raw frequency estimation 
equipercentile 

49.62 

49.29 

48.75 

48.96 

48.99 

Smoothed frequency 
estimation equipercentile 

49.63 

49.28 

48.80 

48.99 

49.04 

Chained linear 

50.00 

49.99 

49.96 

49.95 

49.99 

Raw chained 
equipercentile 

50.01 

50.00 

49.98 

49.97 

49.97 

Smoothed chained 
equipercentile 

50.02 

49.96 

49.99 

49.93 

49.99 

Levine observed-correct 
reliabilities 

50.16 

50.59 

50.80 

50.64 

50.69 

Levine observed-Angoff 
reliabilities 

50.25 

50.47 

50.99 

50.68 

50.74 

Levine true-correct 
reliabilities 

50.16 

50.59 

50.78 

50.58 

50.71 

Levine true-Angoff 
reliabilities 

50.25 

50.47 

50.99 

50.70 

50.71 


Note. X P = 50. 
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Table 8 


Equating Method Averages Across Reliability Combinations When P and Q Abilities Were 
Unequal and Population Standardized Ability Difference = .30 


Equating method 


Reliability combination 



to 

l. 

oo 

1 . 

bo 

1 

to 

•9_.6_.6_ 

.9 ,7_.4_.4_.7 

.9_.5_.5_.7 

.7_.5_.5_.9 

Tucker 

49.17 

48.58 

47.44 

48.00 

48.02 

Raw frequency estimation 
equipercentile 

49.19 

48.58 

47.49 

48.00 

48.04 

Smoothed frequency 
estimation equipercentile 

49.20 

48.61 

47.51 

48.06 

48.08 

Chained linear 

49.99 

50.02 

49.96 

49.97 

50.00 

Raw chained 
equipercentile 

50.00 

49.97 

49.99 

49.95 

50.00 

Smoothed chained 
equipercentile 

49.98 

49.99 

49.96 

49.95 

49.99 

Levine observed-correct 
reliabilities 

50.32 

51.24 

51.68 

51.37 

51.41 

Levine observed-Angoff 
reliabilities 

50.49 

51.00 

52.08 

51.47 

51.50 

Levine true-correct 
reliabilities 

50.30 

51.21 

51.67 

51.25 

51.44 

Levine true-Angoff 
reliabilities 

50.49 

51.00 

52.09 

51.51 

51.45 


Note. X P — 50. 


Two additionally important findings could not be observed in the variability of the 
equating method averages &tX P scores of 50. The averages for the Levine true equating method 
were substantially influenced by an interaction between the reliability estimation method and 
reliability levels. Figure 1 plots the difference in the Levine true correct equating method 
averages from the identity function for the population standardized ability difference of 0 and the 
two reliability combinations that featured different test reliabilities. The slope for the Levine true 
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with correct reliabilities equating method changed considerably, decreasing when Kg’s reliability 
was smaller than Xp’s reliability (e.g., reliability combination .9_.5_.5_.7) but increasing when 
Yq s reliability was larger than Ap ’s reliability (e.g., reliability combination ,7_.5_.5_.9). Figure 2 
plots the difference in the averages of the Levine true with Angoff reliabilities equating method 
from the identity function. The slopes of the equating functions shown in Figure 2 were opposite 
and of relatively smaller magnitude than those of Figure 1. 



rel 9 5 5 7 

H-rel 7 5 5 9 




Figure 1. Levine true with correct reliabilities equating method averages, identity 
population standardized ability difference = 0. 



Figure 2. Levine true with Angoff reliabilities equating method averages, identity 
population standardized ability difference = 0. 
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Equating Method Standard Errors 

Table 9 presents the mean squares from the ANOVA of the 15 effects for the equating 
method standard errors for score Xp = 50. These mean squares are ranked in tenns of their 
proportion of variance explained by equating method standard errors. The strongest effects are 
sample size (90%), followed by reliability (7%), and equating method (2%). These three effects 
accounted for more than 99% of the total variance in equating method standard errors. Figures 3- 
32 plot the standard errors conditional on scores across the five reliability combinations for each 
equating method and sample size condition when population standardized ability differences = 0. 
From these plots, the increase in equating method standard errors as sample sizes decrease is 
shown, as is the relatively smaller increase in equating method standard errors as reliabilities 
decrease. The conditioning methods (Tucker, raw, and smoothed frequency estimation 
equipercentile) in Figures 3-11 had equating method standard errors that were smaller and less 
responsive to reliability changes than the chained methods (Figures 12-20). The Levine methods 
in Figures 21-32 had relatively large equating method standard errors that were strongly 
influenced by reliability changes. The series of equating method standard errors are U-shaped for 
the linear methods and dog bone shaped for the equipercentile methods. Finally, the figures also 
show an interactive effect of reliability and sample size, which accounted for about 1% of the 
variation in equating method standard errors. The reliability x sample size interaction is that 
reliability had a much more visible effect on equating method standard errors for smaller sample 
sizes than for larger sample sizes. 


Discussion 

The purpose of this study was to evaluate the impact of reliability on test equating 
methods used in the NEAT design. An essential part of this evaluation was a description of 
reliability’s interaction with the influence of population ability differences on anchor means. 
Two test score models were summarized and compared in terms of their assumptions about the 
contribution of reliability and examinee ability on observed scores. The implicit assumptions of 
different equating methods for addressing reliability and ability differences were related to the 
assumptions made by different test score models, so any equating method might be inaccurate 
when test scores are not perfectly reliable, populations differ in ability, and the equating method 
incorrectly specifies the reliability-ability difference interaction. A simulation was conducted to 
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illustrate the influence of reliability on several equating methods across levels of population 
ability difference, anchor and test reliability levels, and sample size. 


Table 9 


Ranked Mean Squares and Their Percentage of Total Variation in Equating Method Standard 
Errors 


Source 

DF 

Mean 

% of total 

Cumulative % 



square 

variance 

variance 

Sample size 

2 

21.17 

90 

90 

Reliability 

4 

1.62 

7 

96 

Equating 

9 

0.55 

2 

99 

Reliability x sample size 

8 

0.17 

1 

99 

Equating x sample size 

18 

0.06 

0 

100 

Ability x reliability x sample size 

16 

0.02 

0 

100 

Ability x sample size 

4 

0.02 

0 

100 

Ability x reliability 

8 

0.02 

0 

100 

Equating x reliability 

36 

0.01 

0 

100 

Ability 

2 

0.01 

0 

100 

Equating x reliability x sample size 

72 

0.00 

0 

100 

Equating x ability x reliability 

72 

0.00 

0 

100 

Equating x ability x reliability x sample 
size 

144 

0.00 

0 

100 

Equating x ability 

18 

0.00 

0 

100 

Equating x ability x sample size 

36 

0.00 

0 

100 

Total 

449 

23.64 

100 



Note. Xp= 50. 
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Figure 3. Tucker equating method standard errors, population standardized ability 
difference = 0, Np = Nq = 500. 



Figure 4. Tucker equating method standard errors, population standardized ability 
difference = 0, Np = Nq = 1,000. 
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Figure 5. Tucker equating method standard errors, population standardized ability 
difference = 0, Np = Nq = 5,000. 
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Figure 6. Raw frequency estimation equipercentile equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 500. 



Figure 7. Raw frequency estimation equipercentile equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 1,000. 



re l_9_8_8_9 rel_9_6_6_9 * rel_7_4_4_7 
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Figure 8. Raw frequency estimation equipercentile equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 5,000. 
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Figure 9. Smoothed frequency estimation equipercentile equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 500. 



Figure 10. Smoothed frequency estimation equipercentile equating method standard 
errors, population standardized ability difference = 0, N P = Nq = 1,000. 
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Figure 11. Smoothed frequency estimation equipercentile equating method standard 
errors, population standardized ability difference = 0, N P = Nq = 5,000. 
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rel 9 5 5 7 -«■- rel 7 5 5 9 


Figure 12. Chained linear equating method standard errors, population standardized 
ability difference = 0, N P = Nq = 500. 



Figure 13. Chained linear equating method standard errors, population standardized 
ability difference = 0, N P = Nq = 1,000. 
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Figure 14. Chained linear equating method standard errors, population standardized 
ability difference = 0, Np = Nq = 5,000. 
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Figure 15. Raw chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 500. 



Figure 16. Raw chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 1,000. 



Figure 17. Raw chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 5,000. 
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Figure 18. Smoothed chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 500. 



Figure 19. Smoothed chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 1,000. 



Figure 20. Smoothed chained equipercentile equating method standard errors, population 
standardized ability difference = 0, Np = Nq = 5,000. 
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Figure 21. Levine observed with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 500. 



Figure 22. Levine observed with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 1,000. 
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Figure 23. Levine observed with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 5,000. 
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Figure 24. Levine observed with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 500. 



Figure 25. Levine observed with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 1,000. 



Figure 26. Levine observed with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 5,000. 
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Figure 27. Levine true with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 500. 
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Figure 28. Levine true with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 1,000. 



Figure 29. Levine true with correct reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 5,000. 
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Figure 30. Levine true with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 500. 



Figure 31. Levine true with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, Np = Nq = 1,000. 
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Figure 32. Levine true with Angoff reliabilities equating method standard errors, 
population standardized ability difference = 0, N P = Nq = 5,000. 
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The results of the simulation are consistent extensions of what is known about the 
performance of equating methods. When reliabilities become lower and abilities differ, the 
chained, conditioning, and Levine methods disagree more with each other. In tenns of their 
average performance and the data generation model used in this study, Levine true and observed 
ovennatched on ability differences relative to the chained methods, while the conditioning 
methods undennatched on ability differences (also described in Holland, 2004; Livingston, 2004; 
MacCann, 1990). The use of a different data generation model would not be expected to change 
the relative ordering of equated methods’ equated scores, though it would change each method’s 
accuracy (see Wang et al., 2006, for a comparison based on an item response theory (IRT) data 
generation model where, like this study’s results, the conditioning methods were more biased 
and less variable than the chained methods). 

Changes in reliability had a visible effect on equating method standard errors that was 
relatively small when compared to the effect of changes in sample size. In general, the equating 
methods that use the anchor as a conditioning variable tend to exhibit smaller standard errors 
than do chained and Levine methods (von Davier et al., 2004; von Davier & Kong, 2005; Kolen 
& Brennan, 2004). The results of this study’s simulations showed that the standard errors of the 
conditioning methods are less influenced by levels of reliability compared to the chained and 
Levine methods. The equipercentile equating functions (e.g., raw and smoothed frequency 
estimation equipercentile and chained equipercentile) are more variable than their linear 
counterparts (e.g., Tucker and chained linear), but they exhibit responses to reliability changes 
that are similar to their linear counterparts. 

Levine Results 

There were subtle, but understandable, results noted for the Levine methods from the 
simulation. Levine true’s slope varied much more than the slopes of other equating methods 
when the test reliabilities differed because it was the only considered equating method that built 
reliability into its slope (16). As test reliabilities differed but observed score variances remained 
constant, scaling in terms of true score variability was very different from scaling in tenns of 
observed score variability. Varying reliabilities while holding observed score variances constant 
is an unrealistic feature of this study’s generation model, however it is a potential explanation for 
the large difference in the Levine true method’s slope relative to other equating methods’ slopes 
(a phenomenon that is often observed and mulled over in equating practice). The more 
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sophisticated description for this difference is that true score methods such as the Levine true 
method are built to satisfy requirements such as second-order equity (i.e., the error variances of 
e y (x) and Y are equal at given true scores), and these methods can produce very different results 
from observed score methods that are built to match observed score variances (Tong & Kolen, 
2005). 


The incorporation of Angoff (1953) reliability estimates with the Levine methods had 
important effects on the slope for the Levine true method. Angoff reliability estimates are based 
on the classical congeneric model’s assumptions of perfectly correlated anchor and test true 
scores and effective test lengths. The assumptions about variances being proportional to 
reliability were not closely followed in this study, where observed score variances were held 
constant as reliabilities were altered. Table 4 shows this study’s generated (correct) reliabilities 
and Angoff estimates for the five reliability combinations. When a more reliable (Xp=. 9) test is 
equated to a less reliable (Yq=. 7) test and the test and anchor observed score standard deviations 
of this study are used (Table 2), the Levine true function’s slope (from 17) is 


■VJ(18) VA(5.4) 
VA(5.4) ^9(18) 


■> 47 ( 18 ) = .-Jj 

V^9(18) V^9 


= .88 using correct reliabilities (Figure 1) and 


■VJ8(18) VA4(5.4) 
V!45(5.4) V^83(18) 


VJ8VA4 

VA5^83 


= 1.06 using Angoff reliabilities (Figure 2). The reversed and 


smaller magnitude slopes of the Levine true method with Angoff reliabilities rather than correct 
reliabilities is directly attributable to the extent of inaccuracy in the Angoff reliabilities. 

Equating method standard errors were affected by whether correct reliabilities were over 
or underestimated by the Angoff (1953) reliabilities (Table 4). Levine observed’s equating 
functions were generally more variable with Angoff reliabilities than with correct reliabilities, 
except when both total tests’ reliabilities were underestimated (the reliability combination of 
,9_.6_.6_.9). the Levine true method’s equating functions became more variable when the 
Angoff reliability estimates were used and they overestimated Yq ’s reliability (reliability 
combinations of ,9_.8_.8_.9, ,7_.4_.4_.7, and .9_.5_.5_.7) and became less variable when 
Angoff estimates underestimated Yq’s reliability (reliability combinations of .9_.6_.6_.9, and 
.7_.5_.5_.9). The reliability estimation method influences the variability of Levine equating 

functions through the extent of magnification in \Aap-\*aq , setting y ev = ^ ' >g in (14) for (16) and 

■] re ^AQ 


36 



(17). An underestimated Yq reliability and/or an overestimated A q reliability resulted in less 
magnification of \iap-\laq and its sampling variability on the final Levine functions, whereas an 
overestimated Yq reliability and/or an underestimated A q reliability resulted in more 
magnification of \iap-Yaq and its sampling variability on the final Levine functions. 

Implications 

There are important implications for studying reliability as a relationship between score 
models and equating methods. When data are unreliable, the effect of population ability 
differences on test scores depends on the assumed score model, and different score models are 
compatible with some equating methods but not others. In unreliable data, an equating 
practitioner may have to make a nonempirical choice among models based on how reliability 
impacts the test scores, whether unreliability reduces (the Tucker and frequency estimation 
methods), magnifies (the Levine method) or does not affect (the chained linear method) the 
extent to which examinee ability influences test scores. The major basis for this choice may be 
some interpretative evaluation of the quality of the anchor scores for estimating ability effects on 
test scores. 

Relationships between equating methods and score models not considered in this paper 
can potentially be understood in terms of this paper’s discussion. Manipulating reliability in the 
classical congeneric model has a somewhat analogous effect in terms of manipulating reliability 
in a two-parameter IRT model, essentially that reliability reductions reduce the extent to which 
examinee ability is visible on observed scores. For example, if reliability were reduced in a two- 
parameter logistic model through reducing the a; parameter in 


P{X u =\\0 ti p ti a,) = 


exp [ 0 ,( 0 , -/? )] 

l + exp[« ; (6»-/? )]’ 


(25) 


where 6 t , (i , and a i have their usual meanings as test-taker trait level and item difficulty and 

discrimination parameters, the result would be that the difference between test taker ability and 
item difficulty would be less visible in the IRT-based observed item and test characteristic 
curves. Levine’s magnification of anchor score mean differences may therefore be somewhat 
appropriate for IRT-generated data in the same way that Levine is appropriate for classical 
congeneric models. This suggestion has some support from results showing that IRT and Levine 
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equating methods cluster together when there are population ability differences (Livingston, 
Dorans, & Wright, 1990), and other IRT-simulated results (Wang et ah, 2006) show the same 
bias orderings between chained linear and conditioning equating methods described in the 
introduction. 

Another implication of this study is that more complicated interactions of reliability with 
test score characteristics can potentially be studied with respect to test equating through the use 
of more complex versions of the score models considered in this paper. This paper was 
concerned with the very simple case of tests and anchors with perfectly correlated true scores 
and examinee populations with no systematic subpopulations. In actual data, low and/or unequal 
reliability coincides with lack of population invariance (Dorans & Holland, 2000; Flanagan, 
1951; Holland, Liu, & Thayer, 2005; Kolen, 2004) and imperfectly correlated true scores. Group 
effects and construct differences could be built into many different score models, and then these 
effects could be studied in terms of their implications for equating. Such effects violate equating 
requirements other than the requirement of equal test reliabilities. The study of equating 
methods’ behavior with respect to combinations of equating requirement violations is an 
important way of relating degrees of equating violations to degrees of equating inadequacy. 
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